Abstract
This markdown script encapsulates the methodological backbone of my study of American dissertations in Chinese history. Utilizing a suite of R packages such as tidyverse for data manipulation, ggplot2 for visualization, stm and stminsights for structural topic modeling, ggraph for graph plotting, reshape for data restructuring, kableExtra for advanced table generation, RColorBrewer and pals for color palette, this workflow underpins the data processing, analysis, and presentation of findings. The script stands as a contribution to the commitment to reproducibility and open scholarship, inviting fellow researchers to engage with, scrutinize, and build upon the foundations laid herein.
This study is based on the data extracted from the ProQuest Dissertations and Thesis database. The dataset contains the metadata and abstracts of 1,132 dissertations completed in American universities between 1985 and 2022. This script does not include the pre-processing workflow that transformed the original dataset into the ‘usdiss4’ dataset used here. Yet the R-script that I wrote for data cleaning and formatting is available as an R-script (UShistdissPrep.R) on the uschinadiss project on GitHub.
This guide addresses historians with a basic knowledge of language programming. We provide the code for the sake of traceability and reproducibility in research, but also to provide a workflow that other scholars can use and adapt for their own purpose. The display of code is optional. One can choose skip the code and just focus on the results and basic analyses. For access to the dataset and other scripts, please refer to the uschinadiss project on GitHub.
This script will guide the reader through the successive steps that I followed to process and analyze the data on American doctoral dissertations on China from the production of basic statistical measures to the analysis, visualization, and interpretation of the data using various computational methodologies. Of course, only elementary elements of analysis are provided here. For a more systematic analysis, please refer to my working paper “Who owns China’s Past? American Universities and the Writing of Chinese History” on the PEERS platform.
In this Markdown script, we shall use a series of approaches to transform the metadata on American docroal dissertations in Chinese history extracted from the ProQuest Dissertations & Thesis platform to perform a complete chain of operations from basic statistical computing, to mapping, and to topic modeling:
Statistical analysis and basic visualization of the dataset. Construction of spatial data and mapping. Textual analysis of the keywords and abstracts. Implementation of topic modeling Introduction of various topic modeling tools
This study is based on the data extracted from the ProQuest Dissertations and Thesis database. The dataset contains the metadata and abstracts of 1,132 dissertations completed in American universities between 1985 and 2022.
My purpose is to study the set of doctoral dissertations produced in American universities as a central contribution to historical knowledge on China. From a dataset that contain mostly metadata bout the dissertation, I develop a workflow that studies the metadata from various angles to uncover structures and patterns in the production of the dissertations and their content. Ultimately, I hope to highlight the trends in historical research about China.
Upload the US dissertations dataset
knitr::opts_chunk$set(echo = TRUE)
usdiss4 <- read_delim("usdiss4.csv", delim = ";",
escape_double = FALSE, col_types = cols(Period_Zh = col_skip()),
trim_ws = TRUE)
Check out the variables in the dataset
knitr::opts_chunk$set(echo = TRUE)
colnames(usdiss4)
## [1] "StoreId" "Author" "Nat" "Title"
## [5] "Period" "Abstract" "Year" "DegYear"
## [9] "Degree" "Country" "School_Name" "Department"
## [13] "Department_Strd" "Subjects" "Keywords" "Keywords_Ext"
Examine the first 3 rows of the dataset
knitr::opts_chunk$set(echo = TRUE)
head(usdiss4, 3) %>% # Select only the first 3 rows of the dataframe
kable("html", escape = FALSE) %>% # Create the kable table in HTML format
kable_styling(bootstrap_options = c("striped", # Add Bootstrap styling options
"hover",
"condensed"),
full_width = F, # Set to FALSE to avoid full width
position = "left") %>% # Position the table to the left
column_spec(1, width = "150px") %>% # Adjust the width of the first column (if needed)
scroll_box(width = "100%", height = "500px") # Add a scroll box if the table is too large
| StoreId | Author | Nat | Title | Period | Abstract | Year | DegYear | Degree | Country | School_Name | Department | Department_Strd | Subjects | Keywords | Keywords_Ext |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 304714093 | Kim, Haewon | Korea | Unnatural mountains: Meaning of Buddhist landscape in the Precious Rain bianxiang in Mogao Cave 321 | Pre-modern | This dissertation explores a new way of looking at landscape depiction in Buddhist painting during the Tang dynasty (618–907). The materials are landscape features that appear as the background of the sutra illustrations called “ bianxiang (transformation tableau)” in the Dunhuang Mogao Caves in northwestern China. They have long been subjected to the formalistic approach and linear historical perspective, and little attention has been paid to their symbolic meaning and function. This study attempts to show their iconological aspect as a means to make a substantial criticism on the monolithic presence of the modern view of landscape painting and its anachronistic imposition on pre-modern examples. This investigation is formulated as a case study on a painting dated as late seventh century, the Precious Rain bianxiang in Cave 321. The painting has the most elaborate landscape depiction among contemporary bianxiang and is associated with the dynamic historical events around the political empowerment of Wu Zetian (r. 684–705), the only female emperor in China’s history. The study includes careful observation on the formal and stylistic aspects of the painting and its landscape background, and relates them with the main themes and landscape references in the sūtra, along with the historical circumstances of the period, to draw religious and political meanings of landscape. My conclusion is that the landscape background in this painting played a most significant and effective role in conveying the religious and political messages of the painting. The title of the dissertation “Unnatural Mountains” refers to the two major points that this study is trying to make: (1) .mountain landscape in the painting is not a direct transcription of the world but rather a sign embedded with meanings created within its religious and political contexts (2) .illusory mountains in the painting is an imperial symbol of Wu Zetian that accords with her unique and unconventional political position as the only female emperor in China’s history. | 2001 | 2001 | Ph.D. | United States | University of Pennsylvania | NA | NA | Art History, History | Buddhist, China, Dunhuang Mogao Caves, Iconography, Landscape, Painting, Tang dynasty | Art History, History, Buddhist, China, Dunhuang Mogao Caves, Iconography, Landscape, Painting, Tang dynasty |
| 305447585 | Kim, Jaeyoon | Korea | The Red Turban Rebellions and the emergence of ethnic consciousness of the Hakkas in nineteenth-century China | Contemporary | My dissertation, The Red Turban Rebellions and the Emergence of Ethnic Consciousness of the Hakkas in Nineteenth-Century China, focuses on one of most important and controversial minorities in China—and a group that significantly shaped the country’s nineteenth and twentieth century history: the Hakka or guest people. Han Chinese who migrated from western Fujian to Guangdong province in search of new economic opportunities over the course of the eighteenth and nineteenth centuries, these guest people challenged the economic control of earlier settlers in these provinces and thereby sparked some of the most violent struggles of late Qing China. I examine, in particular, how the participation of the guest people in a series of struggles, the Red Turban Rebellions (1854-1856) and the Hakka-Punti War (1856-1867) in the Pearl River Delta areas of South China, helped create among these people a distinct sense of identity, a sharp sense of their own, different, Hakka, ethnicity. My study is designed to provide a detailed historical analysis of the construction of Hakka identity. I focus on the whole network of different interests and relationships that led to the Red Turban Rebellions and the Hakka-Punti War of the mid-nineteenth century: the long-standing economic conflicts over land use. the part played by local gentry and lineage organizations in Hakka-Punti feuds. the role that the state, and most particularly local governments, played in intensifying existing tensions and thus drawing ethnic lines. In short, in focusing intensively on one particular place and time, my work provides a full and rich picture of all the factors–economic, political, as well as social–that contributed to the definition of Hakka ethnicity. My dissertation thus helps us understand more precisely the complex process by which ethnicity is constructed. | 2005 | 2005 | Ph.D. | United States | University of Oregon | NA | NA | History, Minority & ethnic groups, Sociology | China, Ethnic consciousness, Hakkas, Nineteenth century, Red Turban Rebellions | History, Minority & ethnic groups, Sociology, China, Ethnic consciousness, Hakkas, Nineteenth century, Red Turban Rebellions |
| 2080000000 | Kim, Jaymin | Korea | Asymmetry and Elastic Sovereignty in the Qing Tributary World: Criminals and Refugees in Three Borderlands, 1630s-1840s | Modern | This dissertation analyzes how Qing China (1636-1912) and three of its tributary states (Chos?n Korea, Vietnam, Kokand) handled interstate refugees and criminals from the 1630s to the 1840s. I use Classical Chinese and Manchu memorials and diplomatic documents from Qing archives in Beijing and Taipei as well as Chinese, Korean, and Vietnamese published sources to construct a bilateral view of these interstate relations and compare them. My research reveals multiple, flexible, and shifting conceptions of boundaries, jurisdiction, and sovereignty. Boundaries between Qing and its tributaries were not absolute to a Qing court that claimed universal rule, and the court often erased them by adopting tributary refugees as Qing subjects or encroaching on tributary domains. Further, the Qing court often asserted jurisdiction over tributary subjects committing crimes on its soil or against its subjects. In contrast, no tributary court openly asserted jurisdiction over Qing subjects. Together, these cases reveal two defining characteristics of the Qing tributary order: asymmetry and elastic sovereignty. They show how the political norms of early modern Asia defy post-Westphalian norms of inter-state equality and non-interference in the internal affairs of fellow sovereign states. This work breaks new ground in Chinese history by highlighting Qing imperial projects outside today’s Chinese borders and by comparing borderlands in Northeast, Southeast, and Central Asia. It is also a work of world history that combines the connective method and the comparative method in a novel way, focusing on interactions across interstate boundaries in Asia while comparing these Asian borderlands with those in other early modern empires such as Russia and the Ottoman Empire. Lastly, my work engages with the field of international relations by reconstructing the contours of interstate affairs in early modern Asia before the introduction of public international law to the region, thus answering the recent call by scholars for a more inclusive, pluralistic view of international relations. | 2018 | 2018 | Ph.D. | United States | University of Michigan | History | History | Social sciences, Borderlands, Law, Qing, Sovereignty, Tributary system | NA | Social sciences, Borderlands, Law, Qing, Sovereignty, Tributary system, |
knitr::opts_chunk$set(echo = TRUE)
usdiss4_School <- usdiss4 %>% select(School_Name, Title) %>% group_by(School_Name) %>% count() %>%
arrange(desc(n))
usdiss4_School
### Number of dissertations by year
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Year <- usdiss4 %>% select(Year, Title) %>% group_by(Year) %>% count()
usdiss4_Year
### Number of dissertations by degree
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Deg <- usdiss4 %>% select(Degree, Title) %>% group_by(Degree) %>% count()
usdiss4_Deg
### Number of dissertations by period
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Period <- usdiss4 %>% select(Period, Title) %>% group_by(Period) %>% count() %>%
arrange(desc(n))
usdiss4_Period
To assess more precisely the proportion of dissertation by historical period, I compute the percentage with the sum of the ‘number’ column and I create a new column with the percentage.
knitr::opts_chunk$set(echo = TRUE)
total_sum <- sum(usdiss4_Period$n)
usdiss4_Period$percentage <- (usdiss4_Period$n / total_sum) * 100
usdiss4_Period
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Dpt <- usdiss4 %>% select(Department_Strd, Title) %>% group_by(Department_Strd) %>% count() %>%
arrange(desc(n))
usdiss4_Dpt
This result is not conclusive because the share of missing data (841) is too important.
We can compute the contribution of history departments in
absolute terms.
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Hist <- usdiss4 %>% filter(str_detect(Department_Strd, "History"))
usdiss4_Hist
In the available data, “history” is mentioned 179 times
Plot the number of dissertations per year for the whole data set
knitr::opts_chunk$set(echo = TRUE)
ggplot(data = usdiss4) +
geom_bar(mapping = aes(x = Year), fill="darkblue")+
labs(title = "Number of dissertations per year (1932-2022)",
subtitle = "Dissertations per year",
caption = "based on data extracted from ProQuest Dissertations",
x = "Year",
y = "Number of dissertations")
For a long period, the number of dissertation is insignificant, which
creates an excessively spread out chart. Another reason for the lack of
data is the incompleteness of the data in the ProQuest database.
To obtain a more relevant visulization, I focus on the years
with at least 10 dissertations per year
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Yearfil <- usdiss4_Year %>% filter(n>9)
Plot the number of dissertations per year for the selected dataset. In this visualization, I choose to order data from the year with the lowest number of dissertations to the year with the highest number. This is done by introducing the “reorder(Year, n) argument in the script. This presentation highlights which years were more productive.
knitr::opts_chunk$set(echo = TRUE)
ggplot(usdiss4_Yearfil, aes(x = reorder(Year, n), y = n)) + geom_bar(stat = "identity", fill="palegreen4")+
labs(title = "Number of dissertations per year",
subtitle = "(1990-2022)",
caption = "based on data extracted from ProQuest Dissertations",
x = "Year",
y = "Number of dissertations")
In the script below, I revert to the visualization of the number
of dissertations per year by order of year after selecting the sample of
dissertations produced after 1990.
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Samp <- usdiss4 %>% filter(Year > 1989)
knitr::opts_chunk$set(echo = TRUE)
ggplot(data = usdiss4_Samp) +
geom_bar(mapping = aes(x = Year), fill="darkblue")+
labs(title = "Number of dissertations per year (1990-2022)",
subtitle = "Dissertations per year",
caption = "based on data extracted from ProQuest Dissertations",
x = "Year",
y = "Number of dissertations")
This visualization provides a view of the ups and downs of the
production of dissertations over time. It shows three major, though
unequal peaks around 1999, 2007-08, and 2018.
We can also plot the number of dissertations per university. I
choose a horizontal bar chart with he universities that produced the
highest number of dissertations in descending order.
knitr::opts_chunk$set(echo = TRUE)
ggplot(usdiss4_School, aes(x = reorder(School_Name, n), y = n)) + geom_bar(stat = "identity", fill="palegreen4")+
coord_flip() +
labs(title = "Number of dissertations per university",
subtitle = "(1985-2022)",
caption = "based on data extracted from ProQuest Dissertations",
x = "University",
y = "Number of dissertations")
Using the data in the whole dataset produces too many results for
an effective visualization. The data needs to be filtered. I opt for a
minimum number of 15 dissertations by university.
knitr::opts_chunk$set(echo = TRUE)
usdiss4_Schoolfil <- usdiss4_School %>% filter(n>14)
We can now plot the number of dissertations per university for
the selected sample.
knitr::opts_chunk$set(echo = TRUE)
ggplot(usdiss4_Schoolfil, aes(x = reorder(School_Name, n), y = n)) + geom_bar(stat = "identity", fill="palegreen4")+
coord_flip() +
labs(title = "Number of dissertations per university",
subtitle = "15 dissertations or more (1985-2022)",
caption = "based on data extracted from ProQuest Dissertations",
x = "University",
y = "Number of dissertations")
For a more refined analysis of the evolution of the number of
dissertations over time in each university, I create university-based
datasets.
knitr::opts_chunk$set(echo = TRUE)
# Harvard
usdiss4_Harvard <- usdiss4 %>% filter(str_detect(School_Name, "Harvard"))
# Stanford
usdiss4_Stanford <- usdiss4 %>% filter(str_detect(School_Name, "Stanford"))
# Princeton
usdiss4_Princeton <- usdiss4 %>% filter(str_detect(School_Name, "Princeton"))
# Chicago
usdiss4_Chicago <- usdiss4 %>% filter(str_detect(School_Name, "Chicago"))
# Columbia
usdiss4_Columbia <- usdiss4 %>% filter(str_detect(School_Name, "Columbia"))
# UCIrvine
usdiss4_UCIrvine <- usdiss4 %>% filter(str_detect(School_Name, "Irvine"))
# UCBerkeley
usdiss4_UCBerkeley <- usdiss4 %>% filter(str_detect(School_Name, "Berkeley"))
# Yale
usdiss4_Yale <- usdiss4 %>% filter(str_detect(School_Name, "Yale"))
# Michigan
usdiss4_Michigan<- usdiss4 %>% filter(str_detect(School_Name, "Michigan"))
Plot the number of dissertations per year at Columbia
University
knitr::opts_chunk$set(echo = TRUE)
ggplot(data = usdiss4_Columbia) +
geom_bar(mapping = aes(x = Year), fill="darkblue")+
labs(title = "Columbia dissertations per year (1960-2022)",
subtitle = "Dissertations per year",
caption = "based on data extracted from ProQuest Dissertations",
x = "Year",
y = "Number of dissertations")
Plot the number of dissertations per year at Harvard University
knitr::opts_chunk$set(echo = TRUE)
ggplot(data = usdiss4_Harvard) +
geom_bar(mapping = aes(x = Year), fill="darkblue")+
labs(title = "Harvard dissertations per year (1988-2022)",
subtitle = "Dissertations per year",
caption = "based on data extracted from ProQuest Dissertations",
x = "Year",
y = "Number of dissertations")
Select the universities with 15 dissertations or less using the ‘forcats’ library. This library provides tools for working with factors. Factors are R’s data structure for categorical variables. In this script, the ‘fct_lump’ function lumps together all levels of the School_Name factor that are not in the top 15 most frequent levels into an “Other” level. This is helpful for simplifying factors where there are a lot of levels with few occurrences each. After lumping the less frequent school names into “Other”, the count() function from dplyr is used to count the occurrences of each level of School_Name.
knitr::opts_chunk$set(echo = TRUE)
library(forcats)
usdiss4_SchoolLump <- usdiss4_School %>%
mutate(School_Name = fct_lump(School_Name, n = 15)) %>%
count(School_Name, sort = TRUE)
Another way to analyze the data on dissertations is to locate the ‘sites of production’ on a map of the United States. The original metadata contained only the name of the university, but not the location of the university. Mapping the ‘sites of production’ required a process of identifying the locations. This was done separately as it implied numerous iterations to homogenize the name of universities in our dataset and the names of 1,749 American universities with city names initially found on UniRank. The script for processing the spatial data, including matching city names and states is available on GitHub. The best source, however, is Opendatasoft. It provides a more complete list of 6,559 American universities with city names, state, and geocoordinates.
Upload the ‘US_universities_LocCoord’ file with list of universities and geocoordinates. This is the file that I prepared for a more general purpose that contains a list of 2,130 American universities, both past and present (including universities and colleges that no longer exist). The file includes a columns with the Chinese names of universities when available, although this is not relevant for the present study.
Upload the file with the geocoordinates of universities
knitr::opts_chunk$set(echo = TRUE)
US_universities_LocCoord <- read_delim("US_universities_LocCoord.csv", delim = ",",
escape_double = FALSE, trim_ws = TRUE)
Display the content of the dataset (first 15 rows). It contains
the name of the universities, their location (city, state) and their
geocoordinates (latitude, longitude).
knitr::opts_chunk$set(echo = TRUE)
head(US_universities_LocCoord, 15) %>%
kable("html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped",
"hover",
"condensed"),
full_width = F,
position = "left") %>%
column_spec(1, width = "150px") %>%
scroll_box(width = "100%", height = "500px")
| School_Name | City | State | lat | lng | Country |
|---|---|---|---|---|---|
| American University | Washington D.C. | District of Columbia | 38.9047 | -77.0163 | United States |
| Catholic University of America | Washington D.C. | District of Columbia | 38.9047 | -77.0163 | United States |
| Northern State University | Aberdeen | South Dakota | 45.4649 | -98.4686 | United States |
| Presentation College | Aberdeen | South Dakota | 45.4649 | -98.4686 | United States |
| Abilene Christian University | Abilene | Texas | 32.4543 | -99.7384 | United States |
| Hardin-Simmons University | Abilene | Texas | 32.4543 | -99.7384 | United States |
| McMurry University | Abilene | Texas | 32.4543 | -99.7384 | United States |
| East Central University | Ada | Ohio | 40.7681 | -83.8251 | United States |
| Ohio Northern University | Ada | Ohio | 40.7681 | -83.8251 | United States |
| Chamberlain University | Addison | Texas | 32.9590 | -96.8355 | United States |
| Adrian College | Adrian | Michigan | 41.8994 | -84.0447 | United States |
| Siena Heights University | Adrian | Michigan | 41.8994 | -84.0447 | United States |
| University of South Carolina-Aiken | Aiken | South Carolina | 33.5303 | -81.7271 | United States |
| University of Akron | Akron | Ohio | 41.0798 | -81.5219 | United States |
| Adams State University | Alamosa | Colorado | 37.4752 | -105.8770 | United States |
The kableExtra package is designed to enhance the
default knitr::kable() output for HTML and LaTeX tables.
The functions are being used to transform and style a data frame for
output as an HTML table within an R Markdown document. The functions and
their parameters:
head(US_universities_LocCoord, 15): This function takes
the US_universities_LocCoord data frame and slices the
first 15 rows to be displayed.kable("html", escape = FALSE): This creates a basic
HTML table from the data frame. The escape = FALSE
parameter tells kable not to escape HTML entities within
the table. This is useful when you want to include HTML tags or special
characters in the table cells that should be rendered as HTML.kable_styling(...): This function applies additional
styling to the kable table:
bootstrap_options: This argument applies Bootstrap
classes to the table for additional styling. In this case, “striped”
will add zebra-striping to the table rows, “hover” will enable a hover
state on the rows, and “condensed” will make the table more compact by
cutting cell padding in half.full_width = F: This sets the table width. If
FALSE, the table width will be set to the minimum width
required to display the content without horizontal scrolling.position = "left": This aligns the table to the left of
the container.column_spec(1, width = "150px"): This function is used
to specifically style the first column (1) of the table.
The width = "150px" parameter sets the width of this column
to 150 pixels.scroll_box(width = "100%", height = "500px"): This
function puts the table inside a scrollable box. The
width = "100%" parameter ensures the box spans the entire
width of the container, while height = "500px" sets the box
height to 500 pixels. If the table content exceeds these dimensions,
scroll bars will appear to navigate through the table. The output of
this code will be an HTML table with the first 15 rows of
US_universities_LocCoord, styled with Bootstrap classes,
and contained within a scrollable box that allows users to scroll
through the table if it exceeds the specified dimensions.
To map the universities that produced dissertations, I proceed
in two steps. First, I join the file with the list of universities
prepared previously (usdiss4_School) and the geolocation file uploaded
above (US_universities_LocCoord). This adds the spatial geocoordinates
to each university name.
knitr::opts_chunk$set(echo = TRUE)
usdiss4_SchoolLoc <- left_join(usdiss4_School, US_universities_LocCoord)
Second, I add the locations to the original ‘usdiss4’ file
through a join based on the names of universities.
knitr::opts_chunk$set(echo = TRUE)
usdiss4Loc <- left_join(usdiss4, usdiss4_SchoolLoc)
Before mapping, I want to examine the distribution of dissertations by city. I choose a horizontal bar chart with decreasing values to represent the distribution.
knitr::opts_chunk$set(echo = TRUE)
ggplot(usdiss4Loc, aes(x = reorder(City, n), y = n)) + geom_bar(stat = "identity", fill="palegreen4")+
coord_flip() +
labs(title = "Number of dissertations per city",
subtitle = "(1988-2022)",
caption = "based on data extracted from ProQuest Dissertations",
x = "University",
y = "Number of dissertations")
The plot above is not very satisfying because it is too crowded due
to the high number of cities in the results. To obtain a more readable
visualization I filter the cities with at least 15 dissertations.
knitr::opts_chunk$set(echo = TRUE)
usdiss4Loc2 <- usdiss4Loc %>% filter(n>14) %>%
select(City, n)
I plot the selected sample of dissertations per city
knitr::opts_chunk$set(echo = TRUE)
ggplot(usdiss4Loc2, aes(x = reorder(City, n), y = n)) + geom_bar(stat = "identity", fill="darkblue")+
coord_flip() +
labs(title = "Number of dissertations per city",
subtitle = "15 dissertations or more",
caption = "based on data extracted from ProQuest Dissertations",
x = "University",
y = "Number of dissertations")
To map the universities and their production, I use the ‘leaflet’ library. The ‘leaflet’ library is a powerful and flexible way to create interactive maps. With leaflet, one can create maps that users can zoom in and out of, pan across, and click on to reveal more information.
knitr::opts_chunk$set(echo = TRUE)
library(leaflet)
library(readxl)
Initially, I mapped all the universities, but because Hawaii is located in the middle of the Pacific, it distorts the default map visualization that I want to obtain. In this script, I remove “hawaii” from the dataset.
knitr::opts_chunk$set(echo = TRUE)
usdiss4LocUSA <- usdiss4Loc %>% filter(!str_detect(State, "Hawaii"))
write_csv(usdiss4LocUSA, "usdiss4LocUSA.csv")
us_uni <- usdiss4LocUSA
The distribution of universities that produced dissertation is presented in three successive maps. The first one below shows the universities (in fact the cities where they are located) represented by simple circles. Only the location is represented here. This map gives a preliminary view of the spatial distribution of unives et provides some clues about distribution patterns. We can improve this visualization.
knitr::opts_chunk$set(echo = TRUE)
leaflet(data = us_uni) %>%
addTiles() %>%
addCircleMarkers(~lng, ~lat, popup = ~School_Name)
The second map below shows the universities (in fact the cities
where they are located) represented by circles that were customized to
change the color and the size for better readability. Only the location
is represented here. On this map, I change the symbol opacity and
reduced its size to see better the actual locations.The darker green
color indicates the number of dissertations. Yet, relying solely on
color shade does not convey a clear sense of the relative importance of
each university. In th enext map, I propose a different
visualization.
knitr::opts_chunk$set(echo = TRUE)
leaflet(data = us_uni) %>%
addTiles() %>%
addCircleMarkers(~lng, ~lat, radius = ~n/50,
popup = ~paste(School_Name, ":", n, "dissertations"),
fill = TRUE, fillOpacity = 0.5, color = "green")
The third map below shows the universities (in fact the cities
where they are located) represented by circles that are proprotional to
the number of dissertations produced in each university. This map
retains the color shade code, but the size of the points is proprotional
to the number of dissertations.
knitr::opts_chunk$set(echo = TRUE)
leaflet(data = us_uni) %>%
addTiles() %>%
addCircleMarkers(~lng, ~lat, radius = ~sqrt(n),
popup = ~paste(School_Name, ":", n, "dissertations"),
fill = TRUE, fillOpacity = 0.5, color = "green")
The map highlights the relative level of concentration of Chinese historical studies in a given location or region. We can see very clearly three main clusters of different sizes. The densest cluster is located on the east coast along a Washington D.C.-Cambridge axis. The second most important cluster can be found in California, with two sub-clusters around Berkeley-Stanford and Los Angeles. A less compact cluster can also be seen in the Great Lakes area, with Chicago at its center.
The purpose of this section is to explore the content of the text content of the dissertations. The metadata available for analysis includes the keywords and the abstracts. The first step will be to categorize the dissertation using the keywords to get a sense of how the authors definied their work. The second step aims to uncover the trends in research based on a topic-modeling approach applied to the abstracts. For topic modeling, I use a combination of libraries. The main library for text analysis is ‘stm’ combined to stminsights for interactive visualization.
Some dissertations came without an abstract. Since I need a set without void content, I remove the dissertations that have no abstract.
knitr::opts_chunk$set(echo = TRUE)
usdiss4tk <- usdiss4 %>% filter(!str_detect(Abstract, "Abstract not available"))
I remove all the quotation marks in Abstracts as these elements can affect the code. The quotation marks are often part of code writing. It is best to avoid having quotation marks in the text data to be processed.
knitr::opts_chunk$set(echo = TRUE)
usdiss4tk <- usdiss4tk %>% mutate(name = str_remove_all(Abstract, "\""))
The data is pre-processed to remove from the text all the stop words. I use a custom list for English that I enriched to remove the most frequent and repetitive terms found in most abstracts (examine, chapter, dissertation, etc.).
knitr::opts_chunk$set(echo = TRUE)
usdiss4tkt <- usdiss4tk %>% select(StoreId, Abstract, Title, Year, School_Name, Keywords_Ext)
meta <- usdiss4tkt %>% transmute(StoreId, Title, Year, School_Name, Keywords_Ext)
corpus <- stm::textProcessor(usdiss4tk$Abstract,
metadata = meta,
stem = FALSE,
wordLengths = c(4, Inf),
customstopwords = c("part", "among", "many", "within", "study", "used", "well", "explain", "however", "china", "toward", "chinas", "china's", "chinese", "dissertation", "chapter", "chapters", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "also", "dissertation", "argue", "even", "rather", "examine", "examines", "argues", "explores", "thus"))
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Remove Custom Stopwords...
## Removing numbers...
## Creating Output...
Topic modeling relies on word frequencies and co-occurence in the various documents. It does not make sense to use all the terms present in the text if they do not appear frequently. It is possible toadjust the threshold under which the less frequent terms will be removed from the dataset. In the script below, the threshold is set at 10. All the terms that appear from 1 to 9 times will be removed from the analysis.The stm library indicates the result of the trimming in the console. Usually, in my R scripts I copy-paste these results in my script to keep track of the effect of data processing.
The function returns an object ‘out’ for ‘output’ that includes the document-term matrix (after thresholding), the reduced vocabulary, and the metadata, which can then be used to fit a topic model using the stm function.The result of stm::prepDocuments is a cleaner and more manageable set of data that will likely yield better results when passed into a topic modeling algorithm because it filters out noise and focuses on the more significant terms.
knitr::opts_chunk$set(echo = TRUE)
out <- stm::prepDocuments(corpus$documents, corpus$vocab, corpus$meta, lower.thresh = 10)
## Removing 18028 of 20669 terms (39881 of 141981 tokens) due to frequency
## Your corpus now has 1109 documents, 2641 terms and 102100 tokens.
Removing 18029 of 20671 terms (39883 of 142122 tokens) due to frequency Your corpus now has 1109 documents, 2642 terms and 102239 tokens.
The number of topics is not something that can be defined
arbitrarily. One nees to assess what number of topics will match the
data. The stm::searchK function is used to determine the optimal number
of topics for the topic model. This function evaluates models with
different numbers of topics to see which number provides the best fit
for the data. After calculating K, stm::searchK produces for metrics
that can be examined in a plot.
In our case, Ksearch pointed to an optimal model at 6 or 7 topics. The 6-topic model presents the most optimal parameters. I choose to calculate only tree models for the sake of comparison and to have both optimal models (6, 7) and a model (10) that may provide more granularity.
knitr::opts_chunk$set(echo = TRUE)
Ksearch <- stm::searchK(out$documents, out$vocab, c(5, 6, 7, 10), cores = 1, verbose = FALSE)
plot(Ksearch)
The K search graphs ‘Diagnostic value by number of topics’ aim
to help determine the best number of topics for any topic modeling
exercise. Choosing the right number of topics is crucial in topic
modeling as it can influence the interpretability and usefulness of the
generated topics. Held-out Likelihood: This metric gives an idea of how
well the model predicts unseen data. In the context of topic modeling,
it often refers to a method where a portion of each document is “held
out” or not shown to the model during training. After training, the
model tries to predict the held-out words, and the likelihood of the
actual held-out words under the model is computed. A higher held-out
likelihood indicates a better fit to the unseen data. However, be
cautious: a model that fits the training data too closely might overfit
and not generalize well to new, unseen documents.
Residuals are the differences between the observed values and
the values predicted by the model. In topic modeling, residuals might
refer to the difference between the observed word distributions in
documents and the word distributions predicted by the model.Smaller
residuals indicate that the model’s predictions are closer to the
observed data. However, like with held-out likelihood, if the residuals
are too small, it might indicate overfitting.
Semantic Coherence measures how topically coherent the words
within each topic are. In other words, it gauges how semantically
similar the top words in a topic are to each other. A high semantic
coherence generally suggests that the words within a topic make sense
together and that the topic is interpretable and meaningful. Models with
higher semantic coherence are often preferred as they tend to produce
more interpretable topics.
Lower Bound. In Bayesian topic modeling, the true likelihood of
the data given the model is often intractable to compute directly.
Instead, algorithms often optimize a lower bound on this likelihood.
Tracking the lower bound can give insights into how well the model is
fitting the data. A higher lower bound indicates a better fit. However,
as always, be cautious about overfitting.
When selecting the number of topics, it is important to consider
all these metrics together rather than relying on just one. Often,
there’s a trade-off: models with more topics might fit the data better
(higher likelihood or lower residuals) but might produce less coherent
topics (lower semantic coherence). The ideal is to find a balance where
the topics are both interpretable and provide a good fit to the
data.
Building the model with 6 topics
knitr::opts_chunk$set(echo = TRUE)
mod.6 <- stm::stm(out$documents, out$vocab, K=6, prevalence =~ School_Name + Year, data=out$meta, verbose = FALSE)
Building the model with 7 topics
knitr::opts_chunk$set(echo = TRUE)
mod.7 <- stm::stm(out$documents, out$vocab, K=7, prevalence =~ School_Name + Year, data=out$meta, verbose = FALSE)
Building the model with 10 topics
knitr::opts_chunk$set(echo = TRUE)
mod.10 <- stm::stm(out$documents, out$vocab, K=10, prevalence =~ School_Name + Year, data=out$meta, verbose = FALSE)
The dataset includes temporal data (Year). This may be useful for an examination of evolution over time. We estimate the time effect for year and university (school)
knitr::opts_chunk$set(echo = TRUE)
effect_6 <- stm::estimateEffect(1:6 ~ School_Name + Year, mod.6, meta=out$meta)
effect_7 <- stm::estimateEffect(1:7 ~ School_Name + Year, mod.7, meta=out$meta)
effect_10 <- stm::estimateEffect(1:10 ~ School_Name + Year, mod.10, meta=out$meta)
This section displays the proportion the of topics for each document, along with their metadata. These metrics are a crucial element of information to determine how the topics are represented across the whole corpus and in individual documents. This can be used to determine the documents with the higher proportion of a given topic and select the documents to be examined as representative documents to assess and label the topics built by the model. The topics are never defined by the model. The model provides list of terms that embody the nature of th etopic. It is left to the researcher to qualify each topic with concise labels.
knitr::opts_chunk$set(echo = TRUE)
topicprop6<-make.dt(mod.6, meta)
topicprop6
knitr::opts_chunk$set(echo = TRUE)
topicprop7<-make.dt(mod.7, meta)
topicprop7
knitr::opts_chunk$set(echo = TRUE)
topicprop10<-make.dt(mod.10, meta)
topicprop10
Visualize the distribution of document-topic proportions. In topic modeling, each document is assumed to be a mixture of various topics. The document-topic proportion is a measure of how much each topic is represented in a given document. For example, a document-topic proportion could tell you that a particular document consists of 30% Topic A, 20% Topic B, and so on. Maximum A Posteriori (MAP) estimation is a statistical estimate of an unknown quantity (in this case, the document-topic proportions) that equals the mode of the posterior distribution. The MAP estimate gives you the most likely value of the proportion of each topic in a document after observing the data. When aggregated, the MAP estimates of document-topic proportions across all documents, one gets a distribution that describes the variability and central tendency of topic prevalence across the corpus. In practice, the distribution of MAP estimates of document-topic proportions provides insight into which topics are most prevalent in the corpus, how topics are mixed within documents, and potentially how documents relate to one another based on their topic composition.
Visualizing the distribution of document-topic proportions for
the 6-topic model
knitr::opts_chunk$set(echo = TRUE)
plot.STM(mod.6, "hist")
This graph shows that in many cases a given topic is not represented in
the documents (left-most bar). We can also see that most topics are
present in the same proportions with the same distribution, except Topic
4 which has a higher representation in a more concentrated set of
documents.
Visualizing the distribution of document-topic proportions for the 7-topic model
knitr::opts_chunk$set(echo = TRUE)
plot.STM(mod.7, "hist")
Visualizing the distribution of document-topic proportions for the 10-topic model
knitr::opts_chunk$set(echo = TRUE)
plot.STM(mod.10, "hist")
In this section we compute and visualize the topic distribution per document. The tidy function returns a tidy data frame where each variable is in a column, each observation is in a row, and each type of observational unit forms a table. The theta matrix contains the document-topic proportions. It shows the relationship between the documents and the topics, indicating how much each document pertains to each topic.
knitr::opts_chunk$set(echo = TRUE)
td_theta6 <- tidytext::tidy(mod.6, matrix = "theta")
td_theta7 <- tidytext::tidy(mod.7, matrix = "theta")
td_theta10 <- tidytext::tidy(mod.10, matrix = "theta")
It is not possible to visualize topic proportion for all documents at once. We proceed by steps, starting with the first 15 documents in the different models. Be careful to select a sensible interval, as attempting to load a very huge corpus might crash the kernel.
knitr::opts_chunk$set(echo = TRUE)
selectiontdthteta6<-td_theta6[td_theta6$document%in%c(1:15),]
selectiontdthteta7<-td_theta7[td_theta7$document%in%c(1:15),]
selectiontdthteta10<-td_theta10[td_theta10$document%in%c(1:15),]
Visualizing topic proportions for the first 15 documents in the
6-topic model
knitr::opts_chunk$set(echo = TRUE)
thetaplot6<-ggplot(selectiontdthteta6, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ document, ncol = 3) +
labs(title = "Theta values per document (first 15 documents)",
y = expression(theta), x = "Topic")
thetaplot6
Visualizing topic proportions for the first 15 documents in the
7-topic model
knitr::opts_chunk$set(echo = TRUE)
thetaplot7<-ggplot(selectiontdthteta7, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ document, ncol = 3) +
labs(title = "Theta values per document (first 15 documents)",
y = expression(theta), x = "Topic")
thetaplot7
The distribution of topics in the documents is very uneven. A topic can be highly represented in a document, or a document can relate to several topics. In the graph above, we can see that in the first 15 documents of the 7-topic model Topic 7 is highly represented in document 3, while Topic 4 is equally highly represented in documents 8, 13 and to a lesser degree documents 4 and 6. Conversely, document 1 contain several topics at almost the same level of importance. These values provide a measure of how the algorithm has calculated the distribution of topics across all documents and for the corpus as a whole.
We can also select the last 15 documents in the same way
knitr::opts_chunk$set(echo = TRUE)
selectiontdthteta6l<-td_theta6[td_theta6$document%in%c(1015:1026),]
Visualizing topic proportions for the last 15 documents in the
6-topic model
knitr::opts_chunk$set(echo = TRUE)
thetaplot6l<-ggplot(selectiontdthteta6l, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ document, ncol = 3) +
labs(title = "Theta values per document (bottom list)",
y = expression(theta), x = "Topic")
thetaplot6l
Next, we want to understand more about each topic – what are they really about. If we go back to the β matrix, we can have a more analytical look at the word frequencies per topic. The matrix stores the log of the word probabilities for each topic, and plotting it can give us a good overall understanding of the distribution of words per topic.
In the script below, we compute and visualize the word frequencies for all topics in the the 6-, 7-, and 10-topic models.
knitr::opts_chunk$set(echo = TRUE)
td_beta6 <- tidytext::tidy(mod.6)
options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100)
td_beta6 %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic (6 topics)",
subtitle = "Different words are associated with different topics")
This graph displays the ten most frequent words associated to
each topic in the 6-topic model. These words can be used to define the
nature of the topic and to give it a preliminary label. This list if of
course very short and based on a single mode of computing (word
frequency). The graphs below provide the same information for the 7- and
10-topic models.
knitr::opts_chunk$set(echo = TRUE)
td_beta7 <- tidytext::tidy(mod.7)
options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100)
td_beta7 %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic (7 topics)",
subtitle = "Different words are associated with different topics")
knitr::opts_chunk$set(echo = TRUE)
td_beta <- tidytext::tidy(mod.10)
options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100)
td_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic (10 topics)",
subtitle = "Different words are associated with different topics")
Since the graphs above provide only a limite dlust of terms, it is useful to have a more detailed look at the word distribution within each topic. In the graphs below, we shall examine a more detailed list of the words associted with each topic in the 6-topic model.
Topic 1
knitr::opts_chunk$set(echo = TRUE)
beta6T1<-td_beta6 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 1") #beta values for topic 1
beta6plotT1<-ggplot(beta6T1[beta6T1$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 1") #plot word probabilities higher than 0.003 for topic 1
beta6plotT1
Topic 2
knitr::opts_chunk$set(echo = TRUE)
beta6T2<-td_beta6 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 2") #beta values for topic 2
beta6plotT2<-ggplot(beta6T2[beta6T2$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 2")
beta6plotT2
Topic 3
knitr::opts_chunk$set(echo = TRUE)
beta6T3<-td_beta6 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 3") #beta values for topic 3
beta6plotT3<-ggplot(beta6T3[beta6T3$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 3")
beta6plotT3
Topic 4
knitr::opts_chunk$set(echo = TRUE)
beta6T4<-td_beta %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 4") #beta values for topic 4
beta6plotT4<-ggplot(beta6T4[beta6T4$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 4")
beta6plotT4
Topic 5
knitr::opts_chunk$set(echo = TRUE)
beta6T5<-td_beta %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 5") #beta values for topic 5
beta6plotT5<-ggplot(beta6T5[beta6T5$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 5")
beta6plotT5
Topic 6
knitr::opts_chunk$set(echo = TRUE)
beta6T6<-td_beta %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 3") #beta values for topic 3
beta6plotT6<-ggplot(beta6T6[beta6T6$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 6")
beta6plotT6
We repeat the visualization for the topics in the 7-topic
model.
Topic 1 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T1<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 1") #beta values for topic 1
beta7plotT1<-ggplot(beta7T1[beta7T1$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 1")
beta7plotT1
Topic 2 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T2<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 2") #beta values for topic 2
beta7plotT2<-ggplot(beta7T2[beta7T2$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 2")
beta7plotT2
Topic 3 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T3<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 3") #beta values for topic 3
beta7plotT3<-ggplot(beta7T3[beta7T3$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 3")
beta7plotT3
Topic 4 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T4<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 4") #beta values for topic 4
beta7plotT4<-ggplot(beta7T4[beta7T4$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 4")
beta7plotT4
Topic 5 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T5<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 5") #beta values for topic 5
beta7plotT5<-ggplot(beta7T5[beta7T5$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 5")
beta7plotT5
Topic 6 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T6<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 6") #beta values for topic 6
beta7plotT6<-ggplot(beta7T6[beta7T6$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 6")
beta7plotT6
Topic 7 in mod.7
knitr::opts_chunk$set(echo = TRUE)
beta7T7<-td_beta7 %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
filter(topic=="Topic 7") #beta values for topic 7
beta7plotT7<-ggplot(beta7T7[beta7T7$beta>0.003,], aes(term, beta, fill = as.factor(topic))) +
geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta),
title = "Word probabilities for Topic 7")
beta7plotT7
The visualizations above allow one to refine the nature of the topics, at the same time as the frequency of the terms that contribute to the topic. If we take the example of the graph above, we can see that the dominant terms are Japanese, Taiwan, economc, and state. This is already a pretty good indication of the thmes covered by the dissertations linked to this topic. As we go down the list, the additional terms reinforce the sense of what is space is concerned and how much the economic dimension is present.
We can explore alternative modes of data display, such as the plot.STM function with “summary” argument. It visualizes in table format the topic distribution (which topics are overall more common) with the most common words for each topic.
knitr::opts_chunk$set(echo = TRUE)
plot.STM(mod.6, "summary", n=5) # distribution and top 5 words per topic
knitr::opts_chunk$set(echo = TRUE)
plot.STM(mod.7, "summary", n=5) # distribution and top 5 words per topic
knitr::opts_chunk$set(echo = TRUE)
plot.STM(mod.10, "summary", n=5) # distribution and top 5 words per topic
In the word frequency graph above, the visualization of the words was based on a single mode of computing. With the labelTopics (or sageLabels) function, we can obtain a more detailed insights on the most frequent words in each topic through four modes of computing: Highest probability (default), FREX words (FREX weights words by frequency and exclusivity to the topic), lift words (frequency divided by frequency in other topics), and score (similar to lift, but with log frequencies). The most frequent words in each topic will appear in the console.
knitr::opts_chunk$set(echo = TRUE)
labelTopics(mod.6, n=10) # complete list of top 10 words per topic
## Topic 1 Top Words:
## Highest Prob: modern, history, western, cultural, political, century, world, intellectual, early, twentieth
## FREX: science, intellectual, ideas, scientific, western, twentieth, thought, intellectuals, knowledge, confucianism
## Lift: analyzing, philosophy, scientific, science, jesuits, journals, historian, confucianism, university, thinkers
## Score: analyzing, intellectual, science, global, scientific, modernity, confucianism, intellectuals, philosophy, revolution
## Topic 2 Top Words:
## Highest Prob: economic, state, development, local, economy, social, market, production, system, rural
## FREX: industry, kong, hong, economy, labor, industrial, market, peasant, peasants, workers
## Lift: censorship, industry, kong, size, cost, peasant, subsistence, manufacturing, industries, commodities
## Score: censorship, kong, rural, industrial, industry, hong, economy, cinema, workers, peasant
## Topic 3 Top Words:
## Highest Prob: women, social, local, political, society, government, movement, state, medical, education
## FREX: women, christian, medical, church, health, missionaries, medicine, missionary, party, women’s
## Lift: catholic, churches, church, women’s, converts, christians, health, care, christian, commoners
## Score: catholic, women, christian, medical, church, womens, health, medicine, women’s, missionary
## Topic 4 Top Words:
## Highest Prob: buddhist, song, ming, imperial, religious, period, dynasty, early, painting, literati
## FREX: buddhist, painting, buddhism, ritual, tang, song, yuan, text, medieval, paintings
## Lift: annotated, cave, eleventh, iconography, medieval, mortuary, painting, royal, shang, temples
## Score: buddhist, painting, paintings, song, mortuary, buddhism, ming, tang, ritual, literati
## Topic 5 Top Words:
## Highest Prob: cultural, social, shanghai, identity, culture, historical, political, literary, taiwanese, japanese
## FREX: taiwanese, shanghai, fiction, film, writers, literary, identity, identities, opera, literature
## Lift: memories, abstract, novels, fiction, opera, drama, exhibition, entertainment, memory, theater
## Score: abstract, taiwanese, film, literary, shanghai, fiction, urban, opera, music, theatrical
## Topic 6 Top Words:
## Highest Prob: qing, japanese, state, relations, empire, military, states, asia, political, imperial
## FREX: asia, frontier, empire, opium, asian, korea, diplomatic, military, korean, east
## Lift: border, borderland, diplomacy, germany, maritime, mongolia, policymakers, borderlands, diplomatic, frontiers
## Score: vietnam, asia, tibetan, trade, japanese, frontier, manchuria, empire, opium, manchu
labelTopics(mod.7, n=10) # complete list of top 10 words per topic
## Topic 1 Top Words:
## Highest Prob: modern, history, western, century, world, cultural, national, knowledge, early, twentieth
## FREX: science, intellectuals, scientific, modernity, modern, twentieth, global, knowledge, confucianism, ideas
## Lift: analyzing, science, scientific, journals, confucianism, jesuits, understandings, intellectuals, essence, modernity
## Score: analyzing, science, global, modernity, scientific, modern, intellectual, intellectuals, twentieth, confucianism
## Topic 2 Top Words:
## Highest Prob: political, social, cultural, urban, communist, city, party, revolution, culture, history
## FREX: party, hong, urban, socialist, communist, kong, city, violence, soviet, film
## Lift: censorship, cinema, film, films, kong, hong, migrants, zedong, fashion, opera
## Score: censorship, film, communist, kong, socialist, cinema, hong, urban, films, soviet
## Topic 3 Top Words:
## Highest Prob: women, social, local, medical, society, education, womens, family, christian, missionaries
## FREX: christian, women, medical, missionaries, womens, health, church, missionary, medicine, women’s
## Lift: catholic, christian, church, churches, health, women’s, care, christians, converts, anti-christian
## Score: catholic, women, christian, medical, womens, church, health, missionary, women’s, medicine
## Topic 4 Top Words:
## Highest Prob: buddhist, song, religious, dynasty, ritual, tang, period, imperial, early, culture
## FREX: buddhist, buddhism, tang, ritual, song, yuan, medieval, zhou, shang, inscriptions
## Lift: cave, cult, tombs, buddhism, buddhist, daoist, eleventh, medieval, mortuary, royal
## Score: buddhist, song, mortuary, buddhism, tang, ritual, shang, inscriptions, cave, daoist
## Topic 5 Top Words:
## Highest Prob: literary, cultural, political, historical, literature, history, painting, late, social, ming
## FREX: literary, reading, wang, fiction, artists, genre, works, literature, painting, writers
## Lift: abstract, fiction, novels, literary, genre, genres, reading, painter, poetic, authentic
## Score: abstract, painting, literary, paintings, fiction, poetry, literati, ming, artists, texts
## Topic 6 Top Words:
## Highest Prob: qing, state, imperial, relations, military, empire, political, power, century, asia
## FREX: frontier, empire, qing, manchu, opium, military, tibetan, border, asia, southeast
## Lift: border, diplomacy, mongolia, borderlands, maritime, sino-american, tributary, vietnam, xinjiang, diplomats
## Score: vietnam, tibetan, qing, trade, frontier, manchu, opium, xinjiang, empire, ming
## Topic 7 Top Words:
## Highest Prob: japanese, taiwan, economic, state, development, government, colonial, economy, japan, taiwanese
## FREX: taiwanese, manchuria, taiwan, japanese, economy, colonial, japans, taiwans, industrial, manchukuo
## Lift: developmental, industrialization, manchukuo, cost, firms, kai-shek, japans, manchuria, taiwanese, islands
## Score: developmental, taiwan, japanese, taiwanese, manchukuo, manchuria, colonial, taiwans, industrial, japans
labelTopics(mod.10, n=10) # complete list of top 10 words per topic
## Topic 1 Top Words:
## Highest Prob: modern, history, century, intellectual, western, knowledge, cultural, early, twentieth, world
## FREX: intellectual, science, knowledge, intellectuals, modern, twentieth, scientific, modernity, learning, fourth
## Lift: analyzing, science, philosophy, scientific, learning, thinkers, confucianism, intellectuals, intellectual, yang
## Score: analyzing, science, intellectual, scientific, modernity, modern, intellectuals, global, confucianism, twentieth
## Topic 2 Top Words:
## Highest Prob: social, cultural, urban, city, culture, socialist, shanghai, history, local, identity
## FREX: city, urban, socialist, film, media, films, opera, identities, hong, cinema
## Lift: cinema, censorship, film, films, citys, migrants, opera, socialist, city, fashion
## Score: censorship, film, urban, cinema, socialist, films, kong, opera, hong, city
## Topic 3 Top Words:
## Highest Prob: women, social, medical, education, womens, gender, christian, female, missionaries, family
## FREX: women, christian, medical, health, missionary, womens, church, female, women’s, medicine
## Lift: catholic, christian, churches, women’s, care, christians, converts, health, women, missionary
## Score: catholic, women, womens, christian, medical, church, missionary, health, women’s, christianity
## Topic 4 Top Words:
## Highest Prob: buddhist, religious, ritual, practices, buddhism, religion, zhou, early, culture, material
## FREX: buddhism, buddhist, ritual, zhou, shang, religion, religious, medieval, rites, monks
## Lift: mortuary, shang, tombs, buddhism, rites, monks, royal, cosmology, bronze, lineages
## Score: buddhist, mortuary, buddhism, shang, ritual, religious, medieval, tombs, rites, zhou
## Topic 5 Top Words:
## Highest Prob: literary, literature, cultural, historical, texts, late, political, works, history, social
## FREX: literary, literature, reading, fiction, writers, works, poetry, abstract, music, genre
## Lift: abstract, fiction, novels, poetic, literary, reading, readers, print, writers, description
## Score: abstract, literary, fiction, poetry, texts, literati, music, genre, writers, novels
## Topic 6 Top Words:
## Highest Prob: relations, asia, foreign, states, international, east, american, world, asian, policy
## FREX: asia, opium, east, korean, diplomatic, asian, international, british, united, relations
## Lift: diplomacy, german, opium, sino-american, vietnam, sino-soviet, diplomatic, germany, overseas, diplomats
## Score: vietnam, american, asia, opium, international, trade, diplomatic, soviet, german, sino-american
## Topic 7 Top Words:
## Highest Prob: japanese, taiwan, economic, colonial, development, state, rural, taiwanese, economy, japan
## FREX: taiwanese, colonial, taiwan, manchuria, industrial, taiwans, japanese, manchukuo, peasant, labor
## Lift: developmental, manchukuo, industrialization, taiwanese, peasant, factory, taiwans, manchuria, industrial, farmers
## Score: developmental, taiwan, japanese, taiwanese, colonial, manchukuo, manchuria, rural, taiwans, industrial
## Topic 8 Top Words:
## Highest Prob: qing, state, local, imperial, century, legal, officials, system, empire, power
## FREX: qing, legal, frontier, tibetan, officials, administrative, manchu, muslim, eighteenth, xinjiang
## Lift: criminal, qianlong, muslim, xinjiang, borderlands, frontiers, borderland, islamic, frontier, rebellion
## Score: criminal, qing, tibetan, frontier, legal, manchu, xinjiang, court, local, ming
## Topic 9 Top Words:
## Highest Prob: political, movement, government, communist, party, nationalist, revolution, military, national, state
## FREX: party, communist, movements, campaign, democratic, ethnic, movement, nationalist, youth, democracy
## Lift: minorities, mongolian, zedong, democratic, youth, post-war, kuomintang, campaign, party, authoritarian
## Score: mongolian, communist, democratic, party, movement, nationalist, revolution, youth, mongols, revolutionary
## Topic 10 Top Words:
## Highest Prob: song, painting, tang, dynasty, northern, ming, literati, paintings, political, yuan
## FREX: song, painting, tang, paintings, yuan, northern, architectural, artistic, southern, style
## Lift: cave, song, architectural, painting, ninth, tang, buildings, stylistic, yuan, paintings
## Score: cave, painting, song, paintings, tang, yuan, literati, artistic, ming, architectural
With this method, one can examine selected topics for a given model. In the example below, we display only topics 1, 3, and 6 in the 6-topic model. This can be used for a comparative examination of topics.
knitr::opts_chunk$set(echo = TRUE)
labelTopics(mod.6, topics=c(1,3,6), n=10) # complete list of top 10 words per topics 1,3,6
## Topic 1 Top Words:
## Highest Prob: modern, history, western, cultural, political, century, world, intellectual, early, twentieth
## FREX: science, intellectual, ideas, scientific, western, twentieth, thought, intellectuals, knowledge, confucianism
## Lift: analyzing, philosophy, scientific, science, jesuits, journals, historian, confucianism, university, thinkers
## Score: analyzing, intellectual, science, global, scientific, modernity, confucianism, intellectuals, philosophy, revolution
## Topic 3 Top Words:
## Highest Prob: women, social, local, political, society, government, movement, state, medical, education
## FREX: women, christian, medical, church, health, missionaries, medicine, missionary, party, women’s
## Lift: catholic, churches, church, women’s, converts, christians, health, care, christian, commoners
## Score: catholic, women, christian, medical, church, womens, health, medicine, women’s, missionary
## Topic 6 Top Words:
## Highest Prob: qing, japanese, state, relations, empire, military, states, asia, political, imperial
## FREX: asia, frontier, empire, opium, asian, korea, diplomatic, military, korean, east
## Lift: border, borderland, diplomacy, germany, maritime, mongolia, policymakers, borderlands, diplomatic, frontiers
## Score: vietnam, asia, tibetan, trade, japanese, frontier, manchuria, empire, opium, manchu
knitr::opts_chunk$set(echo = TRUE)
labelTopics(mod.7, topics=c(1,2,4,7), n=10) # complete list of top 10 words per topics 1,2,4,7
## Topic 1 Top Words:
## Highest Prob: modern, history, western, century, world, cultural, national, knowledge, early, twentieth
## FREX: science, intellectuals, scientific, modernity, modern, twentieth, global, knowledge, confucianism, ideas
## Lift: analyzing, science, scientific, journals, confucianism, jesuits, understandings, intellectuals, essence, modernity
## Score: analyzing, science, global, modernity, scientific, modern, intellectual, intellectuals, twentieth, confucianism
## Topic 2 Top Words:
## Highest Prob: political, social, cultural, urban, communist, city, party, revolution, culture, history
## FREX: party, hong, urban, socialist, communist, kong, city, violence, soviet, film
## Lift: censorship, cinema, film, films, kong, hong, migrants, zedong, fashion, opera
## Score: censorship, film, communist, kong, socialist, cinema, hong, urban, films, soviet
## Topic 4 Top Words:
## Highest Prob: buddhist, song, religious, dynasty, ritual, tang, period, imperial, early, culture
## FREX: buddhist, buddhism, tang, ritual, song, yuan, medieval, zhou, shang, inscriptions
## Lift: cave, cult, tombs, buddhism, buddhist, daoist, eleventh, medieval, mortuary, royal
## Score: buddhist, song, mortuary, buddhism, tang, ritual, shang, inscriptions, cave, daoist
## Topic 7 Top Words:
## Highest Prob: japanese, taiwan, economic, state, development, government, colonial, economy, japan, taiwanese
## FREX: taiwanese, manchuria, taiwan, japanese, economy, colonial, japans, taiwans, industrial, manchukuo
## Lift: developmental, industrialization, manchukuo, cost, firms, kai-shek, japans, manchuria, taiwanese, islands
## Score: developmental, taiwan, japanese, taiwanese, manchukuo, manchuria, colonial, taiwans, industrial, japans
We can further have a glimpse at highly representative documents per each topic with the ‘findThoughts’ function and plot them with ‘plotQuote’. The function will select representative documents of a given topic and display the full text (this might give best results with shorter documents). In the example below, we display representative documents forfor topics 2 and 5.
First, we select the documents and assign the output to a
variable named “thoughts2” and “thoughts5”.
knitr::opts_chunk$set(echo = TRUE)
thoughts2 <- findThoughts(mod.6,texts=usdiss4tk$Abstract, topics=2, n=3)$docs[[1]]# select 3 representative documents per topic 2
thoughts5 <- findThoughts(mod.6,texts=usdiss4tk$Abstract, topics=5, n=3)$docs[[1]]# select 3 representative documents per topic 5
Second, we need to split the screen to display more than one
document at the same time. In the present case, we define a display with
two columns. One needs sometimes to tweak the parameters in ’mar=” to
find the best display mode.
The par function is used to set graphical
parameters:
- mfrow=c(1,2): This parameter sets up the plotting area into a 1 by 2
array, meaning that the subsequent plots will be arranged in a single
row with two plots side by side.
- mar=c(0,0,2,2): This parameter sets the margins on the sides of the
plots. The mar parameter takes a numeric vector of the form
c(bottom, left, top, right), which specifies the size of
the margins in lines of text. In this case, it sets the bottom and left
margins to 0, and the top and right margins to 2 lines each. After
executing this line of code, the next two plots you create will be
arranged next to each other horizontally, with no bottom or left margins
and small top and right margins. Make sure to restore the display to the
default parameters after completing the task.
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,2), mar=c(0,0,2,2))
Third, we display the three most representative documents in
topics 2 and 5.
knitr::opts_chunk$set(echo = TRUE)
plotQuote(thoughts2, width=50, maxwidth=500, text.cex=0.5, main="Topic 2")
plotQuote(thoughts5, width=50, maxwidth=500, text.cex=0.5, main="Topic 5")
This line of code serves to restore the display to the default
parameters
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,1))
Often, topics will share common terms and exhibit a degree of correlation. Topic correlation shows relations between topics based on the proportions of words they have in common. It does not work well with models with few topics, as can bee seen below. In a model with few topics, the topics are more often clearly delineated, with limited overlap between the terms that they contain.
The script below provide a basic visualization for the existence or absence of correlation. In the example below based on the 6-topic model, we see an absence of correlation.
knitr::opts_chunk$set(echo = TRUE)
mod6.out.corr <- topicCorr(mod.6)
plot(mod6.out.corr)
In the 10-topic model, however, several topics share the same vocabulary, although we do not know in which proportions. The graph below only manifests the existence of correlations.
knitr::opts_chunk$set(echo = TRUE)
mod.out.corr <- topicCorr(mod.10)
plot(mod.out.corr)
The Structural Topic Model (STM) package provides a functionality to estimate correlations between topics derived from the model. It supports two primary methods for this purpose: “simple” and “huge”. The “simple” method involves applying a threshold to the covariance matrix to retain significant correlations, offering a straightforward approach to understanding topic relationships. On the other hand, the “huge” method employs a more complex, semiparametric approach that is implemented via the ‘huge’ package, capable of handling high-dimensional data and producing a more refined understanding of the covariances between topics. The choice of method depends on the complexity of the data and the desired granularity of the correlation analysis. The following script demonstrates the use of both methods:
knitr::opts_chunk$set(echo = TRUE)
corrsimple6 <- topicCorr(mod.6, method = "simple", verbose = FALSE)
corrhuge <- topicCorr(mod.10, method = "huge", verbose = FALSE)
In this script, corrsimple6 calculates the topic
correlations for a 6-topic model using the “simple” method, while
corrhuge computes the correlations for a 10-topic model
using the “huge” method. Setting verbose = FALSE suppresses
additional output during the computation, streamlining the process.
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,2), mar=c(0,0,2,2))
knitr::opts_chunk$set(echo = TRUE)
plot(corrsimple6, main = "Simple method")
plot(corrhuge, main = "Huge method")
The graph can be enriched by introducing measures of correlations and visually differentiating more the different elements of the graph. We shall use the 7-topic model as our basis for the graph visualization. Since producing a correlation graph sometimes generates unepecxted issues, we shall follow a step-by-step procedure to make sure that the stm_corrs7 object is a valid graph object that can be processed by ggraph.
First we extract the network from the topic model.
knitr::opts_chunk$set(echo = TRUE)
stm_corrs10 <- get_network(model = mod.10,
method = 'simple',
labels = paste('Topic', 1:10),
cutiso = FALSE)
With this code, all the nodes representing the topics are displayed. To display only the nodes that are correlated, change ‘cutiso = FALSE’ to ‘cutiso = TRUE’ in the
We check the type of object. We expect an ‘igraph’ object
knitr::opts_chunk$set(echo = TRUE)
class(stm_corrs10)
## [1] "tbl_graph" "igraph"
Second we create a minimal ggraph graph.
knitr::opts_chunk$set(echo = TRUE)
ggraph(stm_corrs10, layout = 'fr') +
geom_edge_link() +
geom_node_point() +
geom_node_label(aes(label = name))
We can see here that only three topics are correlated.
Third we add measures to the edges based on weight.
knitr::opts_chunk$set(echo = TRUE)
ggraph(stm_corrs10, layout = 'fr') +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point(size = 4) +
geom_node_label(aes(label = name, size = props), repel = TRUE, alpha = 0.85)
Fourth we add the color of edges and enlarge the label of the nodes.
knitr::opts_chunk$set(echo = TRUE)
ggraph(stm_corrs10, layout = 'fr') +
geom_edge_link(
aes(edge_width = weight),
label_colour = '#fc8d62',
edge_colour = '#377eb8') +
geom_node_point(size = 4, colour = 'black') +
geom_node_label(
aes(label = name, size = props),
colour = 'black', repel = TRUE, alpha = 0.85) +
scale_size(range = c(2, 10), labels = scales::percent) +
labs(size = 1.0, edge_width = 1.0, title = "Simple method") +
theme_graph()
In the case of the 10-topic model, because it contains negative values,
the ‘huge’ method does not apply.
Another way to explore the topics is to examine them side by side. The ‘perspective’ argument enables to compare topics two by two. In the example below, we compare topic 1 and topic 5 in the 6-topic model and topic 2 and topic 6 in the 7-topic model. The closer the terms are to the middle line, the highre the degree of similarity between the two topics. This can be useful to study why two topics that see to relate to the same issues are distinct from each other.
knitr::opts_chunk$set(echo = TRUE)
plot(mod.6, type="perspectives", topics=c(1, 5))
knitr::opts_chunk$set(echo = TRUE)
plot(mod.7, type="perspectives", topics=c(2, 6))
This line of code serves to restore the display to the default parameters
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,1))
Word clouds provide an intuitive, though less rigorous way of visualizing word prevalence in topics. Yet they can be used to get a perspective on topics. They can also be used in publications.
First, we split the screen into two rows to display two word clouds
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,2), mar=c(0,0,2,2))
Second, we use the ‘cloud’ function to compute word clouds for topic 1 and topic 5 in the 6-topic model. Usually, we display word clouds within the same model, but the flexibility of the stm library if flexible and you could display wordclouds from two different models.
knitr::opts_chunk$set(echo = TRUE)
cloud(mod.6, topic = 1, scale = c(4, 0.4))
cloud(mod.6, topic = 5, scale = c(4, 0.4))
This line of code serves to restore the display to the default parameters
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,1))
In the example below, we show how to display a group of four wordclouds together. The first line of the script prepares for splitting into four rows to display the four word clouds.
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(2,4), mar=c(0,0,4,4))
The lines of code below serves to visualize the four wordclouds
as a single image.
knitr::opts_chunk$set(echo = TRUE)
cloud(mod.6, topic = 1, scale = c(4, 0.4))
cloud(mod.6, topic = 3, scale = c(4, 0.4))
cloud(mod.6, topic = 5, scale = c(4, 0.4))
cloud(mod.6, topic = 6, scale = c(4, 0.4))
This line of code serves to restore the display to the default parameters
knitr::opts_chunk$set(echo = TRUE)
par(mfrow=c(1,1))
In the script above, we have provide ways to examine correlations between topics through static visualizations. the ‘toLDA’ function offers the possibility to examine the topics, their content (word frequency by topic and in the whole corpus), and correlations. It is not activated in this script because interactive visualizations are not compatible with the Markdown script. To implement it, paste the code in an R script and the the line of code without the hashtag (#). We provide a single example with the 10-topic model.
knitr::opts_chunk$set(echo = TRUE)
#stm::toLDAvis(mod.10, doc=out$documents)
The stm::toLDAvis function takes a fitted STM object (in
this case, mod.10 which is a model with 10 topics) and the
original document-term matrix or any other input used in the STM
(doc=out$documents), and it transforms this information
into a format that can be used with the LDAvis package. This allows for
an interactive visualization where each topic is represented in a
two-dimensional space based on its similarity to other topics. This
visualization helps in interpreting the topics, as it shows the
distribution of words within each topic and the relative sizes of the
topics.
This function is particularly useful because it enables one to
explore the relationships between different topics in a visually
intuitive way, making it easier to understand the structure of the data
and the nature of the topics extracted by the model.
Topic proportion per year
This line of code creates a new data frame topicprop10s
from topicprop10 by removing columns that are not needed
for the analysis. The select(-c(...)) function is used to
exclude these columns.
knitr::opts_chunk$set(echo = TRUE)
# Remove unwanted columns from topicprop10
topicprop10s <- topicprop10 %>% select(-c(Title, School_Name, Keywords_Ext, Year))
Next, we join the two relevant data frames by the new StoreId column.
The inner_join function is used to merge
topicprop10s with another data frame
usdiss4tkt based on a common column, StoreId.
The result is combined_data, which contains rows that have
matching StoreId values in both data frames.
usdiss4tkt contains the full metadata of the dissertations
corpus.
knitr::opts_chunk$set(echo = TRUE)
combined_data <- inner_join(topicprop10s, usdiss4tkt, by = c("StoreId" = "StoreId"))
We group and summarize the data for visualization. The
group_by function is used to group the combined data by the
Year column, while the
summarise(across(starts_with("Topic"), mean, na.rm = TRUE))
calculates the mean of all columns that start with “Topic” for each
group, while removing NA values
(na.rm = TRUE).
topic_proportion_per_year10 <- combined_data %>%
group_by(Year) %>%
summarise(across(starts_with("Topic"), mean, na.rm = TRUE))
We need to reshape data frame to transforms
topic_proportion_per_year10 from a wide format to a long
format using pivot_longer. The columns except
Year are turned into two new columns: variable
and value. Each variable contains the name of
the original column, and value contains the corresponding
data.
knitr::opts_chunk$set(echo = TRUE)
vizDataFrame10y <- topic_proportion_per_year10 %>% pivot_longer(!Year, names_to = "variable", values_to = "value")
We can now plot topic proportions per year as bar plot and examine
how the prevalence of the topics changed over time. The script
visualizes the data using ggplot2.
knitr::opts_chunk$set(echo = TRUE)
ggplot(vizDataFrame10y, aes(x=Year, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Topics over time in US dissertations",
subtitle = "Topic proportion over time",
caption = "10-topic stm model")
It creates a bar plot with years on the x-axis, topic proportions on
the y-axis, and different colors for each topic. -
geom_bar(stat = "identity") indicates that the heights of
the bars represent the data values directly without any transformation.
- The scale_fill_manual function is used to manually set
the colors of the bars, presumably with alphabet(20)
generating a palette of colors. -
theme(axis.text.x = element_text(angle = 90, hjust = 1))
rotates the x-axis text for better readability. - The labs
function adds labels and a title to the plot.
Depending on the number of topics, the default color may not be optimal. It is possible to change the palette of colors and choose a more appropriate set of colors. In this script, we use the ‘RColorBrewer’ library. To explore other color palettes, one can find the color alphabet at colorbrwer.
The line color_palette <- brewer.pal(10, "Set3") is
creating a color palette using the brewer.pal function from
the RColorBrewer package. The number 10 here
specifies how many different colors you want in the palette, and
"Set3" is the name of the color scheme from which you want
the colors to be selected. One should change the number 10
to a different value if more or fewer distinct colors are needed for
data visualization. For example: - If you have 5 categories to represent
and you use brewer.pal(10, "Set3"), you will get 10 colors,
but the visualization will only use 5 of them. - If you have 12
categories but you only generate 10 colors, two categories will not have
unique colors, which could be misleading or visually unappealing. You
should always adjust the number to match the exact number of unique
colors you need for your specific visualization task.
knitr::opts_chunk$set(echo = TRUE)
color_palette <- brewer.pal(10, "Set3")
Finally, we can plot topic proportions per year as bar plot with the new color palette.
knitr::opts_chunk$set(echo = TRUE)
ggplot(vizDataFrame10y, aes(x=Year, y=value, fill=variable)) +
geom_bar(stat = "identity") +
ylab("proportion") +
scale_fill_manual(values=color_palette, name = "Topic") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Topics over time in US dissertations",
subtitle = "Topic proportion over time",
caption = "10-topic stm model")
Let us review each part of the code:
1.
ggplot(vizDataFrame10y, aes(x=Year, y=value, fill=variable)):
- Initializes a ggplot object with vizDataFrame10y as the
data source. - aes sets the aesthetic mappings:
Year on the x-axis, value on the y-axis, and
variable as the fill color (which differentiates the bars
based on the variable column, representing different
topics).
2. geom_bar(stat = "identity"): - Adds bars to the
plot with heights corresponding to the value column in the
data frame. - stat = "identity" tells ggplot that the data
provided in the y aesthetic is already aggregated, so it
should be used directly to determine the height of the bars.
3. ylab("proportion"): - Sets the label for the
y-axis as “proportion.”
4.
scale_fill_manual(values=color_palette, name = "Topic"): -
Specifies the colors to use for the fill aesthetic manually
based on color_palette. - Sets the legend title to
“Topic.”
5.
theme(axis.text.x = element_text(angle = 90, hjust = 1)): -
Adjusts the theme of the plot, specifically the x-axis text elements. -
Rotates the x-axis labels by 90 degrees and justifies them so that the
text is aligned with the tick marks (useful for long labels).
6.
labs(title="Topics over time in US dissertations", subtitle = "Topic proportion over time", caption = "10-topic stm model"):
- Adds labels to the plot: a main title, a subtitle, and a caption.
If one changes the source of data, there is no need to intervene
on points 1 and 2. Most commonly, one will have to adjust the content in
point 6 (title, sub-title and caption).
An alternative visual representation is to plot topic proportions per year as line plot. It is not very appropriate here due to the number of topics, but this may work for a model with a lower number of topics. We provide the script for refercne.
knitr::opts_chunk$set(echo = TRUE)
ggplot(vizDataFrame10y, aes(x=Year, y=value, group=variable, color=variable)) +
geom_line() +
ylab("proportion") +
scale_color_manual(values = paste0(alphabet(20), "FF"), name = "Topic") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Topics over time in US dissertations",
subtitle = "Topic proportion over time",
caption = "10-topic stm model")
### stminsight LDA interface
For a more complete exploration of topics based on a single or different models, we strongly recommend using the’run_stminsights’ function in stm. Running the function will open an R-shiny window where one can upload the save data of the computed models. One need sto save the project data as an ‘.RData’ file and export it. After uploading the same file in the stminsights interface, the data becomes available under a series of tabs that present each different forms of visualization, as well as the possibility to label the topics more precisely. We provide the line of script to activate ‘run_stminsights’, but we leave it inactive since the Markdown format will crash with interactive visualizations. Below we present four snapshots of the main tabs of the run_stminsights LDA interface.
knitr::opts_chunk$set(echo = TRUE)
#run_stminsights()
# Concluding Remarks
This investigation into American doctoral dissertations on Chinese history is anchored in a dataset that includes metadata and abstracts summarizing each document’s content. The deployment of various computational techniques has unveiled the landscape of academic engagement with China’s historical narrative. By utilizing different R packages, the historiographical contributions of American universities have been methodically dissected.
The statistical and textual scrutiny of dissertation abstracts and keywords has illuminated the thematic undercurrents within scholarly research on Chinese history. The use of author-defined keywords to categorize dissertations has provided insight into their self-perceived scholarly identity. The study’s linchpin, topic modeling, employed Latent Dirichlet Allocation (LDA) to outline the primary thematic threads in the body of work. This computational method quantifies topic prominence and their interrelations, revealing subtle shifts that mark the historiography of Chinese history as framed by American academia.
The necessity of computational methods is clear in managing the extensive corpus these dissertations represent. This strategy is adaptable to a broad spectrum of topics, particularly those emerging from bibliographic database queries such as CNKI, Historical Abstracts, or even journal platforms like JSTOR or MUSE. The markdown script systematically analyzes the data through several steps:
Summarization of Main Trends: The topic modeling has revealed a wide range of research foci, from the political dynamics of ancient dynasties to the revolutions of the modern era. Trends indicate a diversification of interest over time, with early dissertations concentrating on traditional historical narratives and more recent works delving into thematic areas such as gender studies. This shift reflects a broader transformation within the field of historiography, where multi-disciplinary approaches have become increasingly prevalent.
Keyword Categorization: Keywords provided by authors offer a self-reflective glimpse of scholarly identities, serving as an authorial perspective on academic contributions and intentions. This metadata, albeit subjective, highlights the evolution of scholarly discourse and the rise of new terminologies.
Topic Modeling: The LDA topic models have served as a computational microscope, bringing into focus the thematic clusters that dominate the corpus. The most prevalent topics have revolved around the political and economic transformations in Chinese history, indicating a strong historiographical emphasis on structural changes. The inter-topic correlations have further revealed how areas such as social history and international relations have increasingly interwoven, suggesting a more interconnected approach to understanding China’s past.
The findings bear significant weight on the historiography of Chinese history, proposing that American academic institutions are not mere knowledge custodians but active narrators of the Chinese historical account. The scope of dissertations signifies a shift from Eurocentric perspectives to a nuanced comprehension that embraces indigenous viewpoints and the intricacies of China’s global interactions. This research paves the way for future inquiries, such as comparing American dissertations with those from other regions to identify global academic trends or assessing the influence of these works on the broader Chinese studies field.
However, the computational approach has its confines. Topic modeling, while potent, is algorithm-based and may not fully grasp the subtleties of human interpretation. Moreover, the focus on keywords and abstracts means that the deeper insights within full dissertation texts are not examined. Despite these limitations, the computational methods applied are invaluable for navigating historical discourse complexities, offering a replicable model for future historiographical research.
This study contributes to digital historiography by showcasing how computational analysis can refine our understanding of academic patterns and the evolution of historiographical themes. The methodologies introduced are becoming indispensable in the historian’s repertoire, fostering a profound, multifaceted comprehension of how we, as scholars, construct the past and craft the historical narrative for subsequent generations’ interpretation.